We use two datasets:
https://www.kaggle.com/camnugent/sandp500 file: all_stocks_5yrs.csv
This dataset contains all stocks in the S&P 500 index, and their historical prices for the past 5 years
In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
import datetime as dt
from math import sqrt
In [2]:
data = pd.read_csv('./all_stocks_5yr.csv')
In [3]:
data.shape
Out[3]:
In [4]:
data.head()
Out[4]:
For each trading day we have the ticker symbol (Name) and the price at opening (Open), closing (Close), highest (High) and lowest (Low) of the day and the number of stocks traded (Volume).
What are the types of the columns?
In [5]:
data.dtypes
Out[5]:
We will only use the Date, Close and Name features for this exercise.
Convert the Date column to datetime
In [6]:
data['Date'] = pd.to_datetime(data['Date'])
What is the time period over which we have date?
In [7]:
data.Date.min(), data.Date.max()
Out[7]:
Split the total period in 5 periods of approx 1 year each
In [8]:
periods = np.array_split(data.Date.unique(), 5)
In [9]:
for yr in range(5):
print(periods[yr].min(), periods[yr].max())
Create a multi-index for better retrieval later
In [10]:
data.set_index(['Name', 'Date'], inplace=True)
data.head()
Out[10]:
Are there any null values?
In [11]:
pd.isnull(data['Close']).any()
Out[11]:
How much data points do we have?
The list of stock symbols:
In [12]:
symbols = list(data.index.levels[0])
symbols[:10]
Out[12]:
In [13]:
symbols[-10:]
Out[13]:
How many symbols do we have?
In [14]:
len(symbols)
Out[14]:
Facebook (ticker symbol 'FB') is one of these superstar companies of the past 5 year. Plot the price evolution of facebook
In [15]:
plt.gcf().set_size_inches(16, 6)
plt.subplot(121)
data.loc['FB'].Close.plot()
plt.title('Price evolution Facebook over 5 years')
plt.subplot(122)
data.loc['FB'].loc[periods[4]].Close.plot()
plt.title('Price evolution Facebook over the last year')
plt.show()
We can get similar data as those from Kaggle, via the webservice of Yahoo finance The pandas_datareader (https://github.com/pydata/pandas-datareader) module allows easy downloading of stock data in a pandas DataFrame
Let's dowload the S&P 500 index (ticker='^GSPC') historical data.
In [16]:
import pandas_datareader as pdr
sp500_index = pdr.get_data_yahoo('^GSPC')
sp500_index.head()
Out[16]:
What is the date range of the data?
In [17]:
sp500_index.index.min(), sp500_index.index.max()
Out[17]:
Align the date range with the data from Kaggle
In [18]:
sp500_index = sp500_index.loc[data.index.min()[1]:data.index.max()[1],:]
join the sp500 index data with the kaggle data
In [19]:
sp500_index['Name'] = '^GSPC'
sp500_index['Date'] = sp500_index.index
sp500_index.set_index(['Name', 'Date'], inplace=True)
data = pd.concat([data, sp500_index])
In [20]:
symbols.append('^GSPC')
symbols[-10:]
Out[20]:
Facebook is one of those stocks that outperformed the index. Plot the historical return of 'FB' (Facebook) against the S&P 500 index
In [21]:
sp500_return = data.loc['^GSPC'].Close / data.loc['^GSPC'].Close[0]
fb_return = data.loc['FB'].Close / data.loc['FB'].Close[0]
sp500_return.plot(label='SP500')
fb_return.plot(label='FB')
plt.legend()
plt.title('5 year return Facebook vs S&P 500 index')
plt.show()
For the return we perform a linear regressions over the time series period. With the linear trendline we avoid daily fluctuations, and get a general idea of the return. We define the return as the ratio of the price at the end date over the price of the start date (regression values).
We define the risk as the standard deviation of the real prices vs the trendline, relative over the start price (regression value).
For the return-risk-ratio we divide the return by the risk. For alternative return-risk-ratio definition see for example the Sharpe ratio
In [22]:
def return_risk_ratio(ticker, start, end):
"""Return a tuple: (relative return over period, risk over the period, return/risk)"""
# select the data for the ticker and time period
data_masked = data.loc[ticker].loc[start:end]
x = np.arange(data_masked.Close.size)
y = np.array(data_masked.Close)
# Fit a degree 1 regression line
fit = np.polyfit(x, y, 1)
fit_fn = np.poly1d(fit)
y_regression = fit_fn(x)
# Calculate Root Mean Square Error = beta = risk
RMSE = sqrt(mean_squared_error(y_true = y, y_pred = y_regression))
slope = fit_fn.coeffs[0]
y0_regression = fit_fn(x[0])
return_percent = slope / y0_regression * x.size * 100 # return in percent
RMSE_percent = RMSE / y0_regression * 100
return (return_percent, RMSE_percent , return_percent / RMSE_percent)
Get the start and end dates of the last year of data
In [23]:
start_date_iso = dt.date.isoformat(pd.Timestamp(periods[4].min()))
end_date_iso = dt.date.isoformat(pd.Timestamp(periods[4].max()))
print(start_date_iso, end_date_iso)
Check return / risk ratio for Facebook
In [24]:
print(list(return_risk_ratio('FB', start_date_iso, end_date_iso)))
# print(return_risk_ratio('^GSPC', start_date, end_date))
Calculate return / risk ratios for all S&P 500 symbols and store the results in a Dataframe
In [25]:
# create an empty Dataframe
rrr_index = symbols
rrr_columns = ['Period', 'Symbol', 'Return', 'Risk', 'ReturnRiskRatio', 'RiskReturnToIndex']
rrr = pd.DataFrame(columns=rrr_columns)
In [26]:
# Fill the Dataframe with the return risk ratios
n = 0
for i, p in enumerate(periods):
start_date_iso = dt.date.isoformat(pd.Timestamp(p.min()))
end_date_iso = dt.date.isoformat(pd.Timestamp(p.max()))
for s in symbols:
#print(i, s, start_date_iso, end_date_iso)
if data.loc[s].loc[start_date_iso:end_date_iso].size > 0: # data available for that period
ret, risk, ratio = return_risk_ratio(s, start_date_iso, end_date_iso)
rrr.loc[n] = [i, s, ret, risk, ratio, np.NaN]
n = n + 1
rrr.head()
Out[26]:
Set a multi index for easier data retrieval
In [27]:
rrr.set_index(['Period', 'Symbol'], inplace=True)
rrr.head()
Out[27]:
Check the data for Facebook and the SP500 index
In [28]:
rrr.loc[0].loc['FB']
Out[28]:
In [29]:
rrr.loc[0].loc['^GSPC']
Out[29]:
In [30]:
plt.gcf().set_size_inches(16, 16)
subplot = 320
for y in rrr.index.levels[0]:
subplot += 1
plt.subplot(subplot)
rrry = rrr.loc[y]
plt.scatter(rrry.Risk, rrry.Return, alpha=.2, label='individual stock')
plt.scatter(rrry.loc['^GSPC'].Risk, rrry.loc['^GSPC'].Return, c='r', label='Index')
fit = np.polyfit(rrry.Risk, rrry.Return, deg=1)
plt.plot(rrry.Risk, fit[0] * rrry.Risk + fit[1], color='red', label='regression')
plt.xlabel('Risk')
plt.ylabel('Return %')
title = "year {} Return vs Risk".format(y)
plt.title(title)
plt.legend()
plt.show()
In [31]:
for y in rrr.index.levels[0]:
for s in symbols:
if s in rrr.loc[y].index: # data available for that period
rrr.loc[y].loc[s].RiskReturnToIndex = \
rrr.loc[y].loc[s].ReturnRiskRatio / rrr.loc[y].loc['^GSPC'].ReturnRiskRatio
rrr.head()
#rrr.nlargest(20, 'RiskReturnToIndex')
Out[31]:
In [32]:
rrr.nlargest(10, 'RiskReturnToIndex')
Out[32]:
In [33]:
best_performers = rrr[rrr.RiskReturnToIndex > 1]
len(best_performers)
Out[33]:
So only about 20% of the stocks performs better than the SP500 index!
In [34]:
plt.gcf().set_size_inches(16, 16)
subplot = 320
for y in rrr.index.levels[0]:
subplot += 1
ax = plt.subplot(subplot)
ax.set_xlim([0, 20]) # fix risk and return scale to for all subplots
ax.set_ylim([0, 200])
rrry = rrr.loc[y]
best_performersy = best_performers.loc[y]
plt.scatter(best_performersy.Risk, best_performersy.Return, alpha=.2, label='high performing stock')
plt.scatter(rrry.loc['^GSPC'].Risk, rrry.loc['^GSPC'].Return, c='r', label='Index')
plt.xlabel('Risk')
plt.ylabel('Return %')
title = "year {} Return vs Risk, Return(sp500) = {:5.2f}%".format(y, rrry.loc['^GSPC'].Return)
plt.title(title)
plt.legend()
plt.show()
In [35]:
best_stocks = best_performers.index.get_level_values(1)
from collections import Counter
d = Counter(best_stocks)
d = {k: v for k, v in d.items() if v > 2}
print(len(d))
d.keys()
Out[35]:
In [36]:
best_performers.loc[4]
#rrr.loc[4].nlargest(10, 'RiskReturnToIndex')
Out[36]:
In [37]:
bestInYear = list(rrr.loc[4].nlargest(20, 'RiskReturnToIndex').index)
bestInYear
Out[37]:
In [38]:
plt.gcf().set_size_inches(12, 8)
rrry = rrr.loc[4]
best4 = best_performers.loc[4]
plt.scatter(rrry.Risk, rrry.Return, alpha=.2, label='individual stock')
plt.scatter(rrry.loc['^GSPC'].Risk, rrry.loc['^GSPC'].Return, c='r', label='Index')
fit = np.polyfit(rrry.Risk, rrry.Return, deg=1)
plt.plot(rrry.Risk, fit[0] * rrry.Risk + fit[1], color='red', label='regression')
plt.scatter(best4.Risk, best4.Return, color='green', label='best of year')
plt.xlabel('Risk')
plt.ylabel('Return %')
title = "last year Return vs Risk".format(y)
plt.title(title)
plt.legend()
plt.show()
There is no assurance that past performance is an accurate predictor of the future!